An Exploratory Data and Network Analysis of Movies
An Exploratory Data and Network Analysis of Movies
Introduction
In this report, we will be analysing a dataset from Kaggle, which contains movies of different genres produced over a vast number of years. What makes this analysis interesting is that we can try and draw various conclusions based on a movie’s popularity, directors or actors involved, year of production, and so forth. Moreover, we can construct various networks in an attempt to find meaningful and interesting results. When inspecting a database of films from recent years, various interesting inferences are uncovered. A film may have a high rating yet low return on investment (ROI). Which genre would you guess is the most successful? Which actors do you think are the most popular?
We have split our Exploratory Data Analysis into four main parts:
| Section | |
|---|---|
| 1 | Introducing the Data - We first try to understand the data and look at its content. |
| 2 | Pre-Processing - We look at what needs to be altered or removed from the dataset. - We try clean any dirty text. - We try to minimise the dataset’s missing values. |
| 3 | Exploring the Data - We conduct basic analysis on the dataset. - We explore genres. - We explore movie popularity. - We look at profit, gross, and return of interests with movies. - We conduct more advanced analysis on the dataset. |
| 4 | Network Analysis - We measure the network (centrality, degree distribution, number of components, average degree) - We use network measures to highlight certain nodes (actors) and see which measures of an actor will increase ratings and budgets. |
Admin
Before we start, let’s keep this code chunk for importing the correct libraries and loading the appropriate dataset. We use pacman to load the following:
We import the dataset like this:
In the next section we introduce our dataset and look its content.
Introducing The Dataset
This section of the report is quite essential for our analysis. We cannot make any interesting inferences from the dataset if we do not know what is contained within it. In this section we will try to understand exactly what we are dealing with. Thereafter, we can begin to draw interesting results.
The movie_metadata dataset contains 28 unique columns/variables, each of which are described in the table below:
| Variable Name | Description |
|---|---|
| color | Specifies whether a movie is in black and white or color |
| director_name | Contains name of the director of a movie |
| num_critic_for_reviews | Contains number of critic reviews per movie |
| duration | Contains duration of a movie in minutes |
| director_facebook_likes | Contains number of facebook likes for a director |
| actor_3_facebook_likes | Contains number of facebook likes for actor 3 |
| actor_2_name | Contains name of 2nd leading actor of a movie |
| actor_1_facebook_likes | Contains number of facebook likes for actor 1 |
| gross | Contains the amount a movie grossed in USD |
| genres | Contains the sub-genres to which a movie belongs |
| actor_1_name | Contains name of the actor in lead role |
| movie_title | Title of the Movie |
| num_voted_users | Contains number of users votes for a movie |
| cast_total_facebook_likes | Contains number of facebook likes for the entire cast of a movie |
| actor_3_name | Contains the name of the 3rd leading actor of a movie |
| facenumber_in_poster | Contains number of actors faces on a movie poster |
| plot_keywords | Contains key plot words associated with a movie |
| movie_imdb_link | Contains the link to the imdb movie page |
| num_user_for_reviews | Contains the number of user generated reviews per movie |
| language | Contains the language of a movie |
| country | Contains the name of the country in which a movie was made |
| content_rating | Contains maturity rating of a movie |
| budget | Contains the amount of money spent in production per movie |
| title_year | Contains the year in which a film was released |
| actor_2_facebook_likes | Contains number of facebook likes for actor 2 |
| imdb_score | Contains user generated rating per movie |
| aspect_ratio | Contains the size of the aspect ratio of a movie |
| movie_facebook_likes | Number of likes of the movie on its Facebook Page |
Furthermore, the dataset contains 5043 movies, spanning across 96 years in 46 countries. There are 1693 unique director names and 5390 number of actors/actresses. Around 79% of the movies are from the USA, 8% from UK, and 13% from other countries.
The structure of the dataset can also be used to understand our data. We can run the following code chunk to see its structure.
## 'data.frame': 5043 obs. of 28 variables:
## $ color : chr "Color" "Color" "Color" "Color" ...
## $ director_name : chr "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
## $ num_critic_for_reviews : int 723 302 602 813 NA 462 392 324 635 375 ...
## $ duration : int 178 169 148 164 NA 132 156 100 141 153 ...
## $ director_facebook_likes : int 0 563 0 22000 131 475 0 15 0 282 ...
## $ actor_3_facebook_likes : int 855 1000 161 23000 NA 530 4000 284 19000 10000 ...
## $ actor_2_name : chr "Joel David Moore" "Orlando Bloom" "Rory Kinnear" "Christian Bale" ...
## $ actor_1_facebook_likes : int 1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
## $ gross : int 760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
## $ genres : chr "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy" "Action|Adventure|Thriller" "Action|Thriller" ...
## $ actor_1_name : chr "CCH Pounder" "Johnny Depp" "Christoph Waltz" "Tom Hardy" ...
## $ movie_title : chr "Avatar " "Pirates of the Caribbean: At World's End " "Spectre " "The Dark Knight Rises " ...
## $ num_voted_users : int 886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
## $ cast_total_facebook_likes: int 4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
## $ actor_3_name : chr "Wes Studi" "Jack Davenport" "Stephanie Sigman" "Joseph Gordon-Levitt" ...
## $ facenumber_in_poster : int 0 0 1 0 0 1 0 1 4 3 ...
## $ plot_keywords : chr "avatar|future|marine|native|paraplegic" "goddess|marriage ceremony|marriage proposal|pirate|singapore" "bomb|espionage|sequel|spy|terrorist" "deception|imprisonment|lawlessness|police officer|terrorist plot" ...
## $ movie_imdb_link : chr "http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1" ...
## $ num_user_for_reviews : int 3054 1238 994 2701 NA 738 1902 387 1117 973 ...
## $ language : chr "English" "English" "English" "English" ...
## $ country : chr "USA" "USA" "UK" "USA" ...
## $ content_rating : chr "PG-13" "PG-13" "PG-13" "PG-13" ...
## $ budget : num 237000000 300000000 245000000 250000000 NA ...
## $ title_year : int 2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
## $ actor_2_facebook_likes : int 936 5000 393 23000 12 632 11000 553 21000 11000 ...
## $ imdb_score : num 7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
## $ aspect_ratio : num 1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
## $ movie_facebook_likes : int 33000 0 85000 164000 0 24000 0 29000 118000 10000 ...
In the next section we can start preparing the dataset for analysis by removing and simplifying some of the data.
Pre-Processing Data
In this part of the report we attempt to look for various things that may have a negative or significant impact on the inferences we make on the dataset. Once we have sufficiently cleaned and prepared the dataset, we can commence with drawing various conclusions from the graphs we generate.
Duplicate Rows
In the movie_metadata dataset, we can derive that their are 45 duplicated rows which needs to be removed and kept the unique ones.
## [1] 45
Missing Values
Let’s have a look at the number of NA values in our dataset:
## color director_name
## 0 0
## num_critic_for_reviews duration
## 49 15
## director_facebook_likes actor_3_facebook_likes
## 103 23
## actor_2_name actor_1_facebook_likes
## 0 7
## gross genres
## 874 0
## actor_1_name movie_title
## 0 0
## num_voted_users cast_total_facebook_likes
## 0 0
## actor_3_name facenumber_in_poster
## 0 13
## plot_keywords movie_imdb_link
## 0 0
## num_user_for_reviews language
## 21 0
## country content_rating
## 0 0
## budget title_year
## 487 107
## actor_2_facebook_likes imdb_score
## 13 0
## aspect_ratio movie_facebook_likes
## 327 0
To help visualise this, have a look at the following heatmap of the missing values:
# Visualise Missing Values
missing.values <- aggr(movie_metadata, sortVars = T, prop = T, sortCombs = T, cex.lab = 1.5, cex.axis = .6, cex.numbers = 5, combined = F, gap = -.2)##
## Variables sorted by number of missings:
## Variable Count
## gross 0.174869948
## budget 0.097438976
## aspect_ratio 0.065426170
## title_year 0.021408563
## director_facebook_likes 0.020608243
## num_critic_for_reviews 0.009803922
## actor_3_facebook_likes 0.004601841
## num_user_for_reviews 0.004201681
## duration 0.003001200
## facenumber_in_poster 0.002601040
## actor_2_facebook_likes 0.002601040
## actor_1_facebook_likes 0.001400560
## color 0.000000000
## director_name 0.000000000
## actor_2_name 0.000000000
## genres 0.000000000
## actor_1_name 0.000000000
## movie_title 0.000000000
## num_voted_users 0.000000000
## cast_total_facebook_likes 0.000000000
## actor_3_name 0.000000000
## plot_keywords 0.000000000
## movie_imdb_link 0.000000000
## language 0.000000000
## country 0.000000000
## content_rating 0.000000000
## imdb_score 0.000000000
## movie_facebook_likes 0.000000000
Gross and Budget
Since gross and budget have too many missing values (874 and 488), and we want to keep these two variables for the following analysis, we can only delete rows with null values for gross and budget because imputation will not do a good job here.
# Find NA values for gross and budget
movie_metadata <- movie_metadata[!is.na(movie_metadata$gross), ]
movie_metadata <- movie_metadata[!is.na(movie_metadata$budget), ]
dim(movie_metadata)## [1] 3857 28
The difference in observations have decreased by 4998 - 3857 = 1141 which is luckily only 22.8% of the previous total observations.
Content Rating
The dataset contains a vast range of content rating, which can be seen below:
##
## Approved G GP M NC-17 Not Rated
## 51 17 91 1 2 6 42
## Passed PG PG-13 R Unrated X
## 3 573 1314 1723 24 10
We find that M = GP = PG, X = NC-17, so let’s replace M and GP with PG, and X with NC-17, because these are (apparently) what we use today.
# Renaming content ratings
movie_metadata$content_rating[movie_metadata$content_rating == 'M'] <- 'PG'
movie_metadata$content_rating[movie_metadata$content_rating == 'GP'] <- 'PG'
movie_metadata$content_rating[movie_metadata$content_rating == 'X'] <- 'NC-17' # No one under 17We want to replace Approved, Not Rated, Passed, Unrated with the most common rating R.
movie_metadata$content_rating[movie_metadata$content_rating == 'Approved'] <- 'R'
movie_metadata$content_rating[movie_metadata$content_rating == 'Not Rated'] <- 'R'
movie_metadata$content_rating[movie_metadata$content_rating == 'Passed'] <- 'R'
movie_metadata$content_rating[movie_metadata$content_rating == 'Unrated'] <- 'R'
movie_metadata$content_rating <- factor(movie_metadata$content_rating)
table(movie_metadata$content_rating)##
## G NC-17 PG PG-13 R
## 51 91 16 576 1314 1809
Blank spaces should be taken as missing value. Since these missing values cannot be replaced with reasonable data, we delete these rows.
Delete (Some) Rows
Let’s now have a look at how many complete cases we have.
## color director_name
## 0 0
## num_critic_for_reviews duration
## 1 0
## director_facebook_likes actor_3_facebook_likes
## 0 6
## actor_2_name actor_1_facebook_likes
## 0 1
## gross genres
## 0 0
## actor_1_name movie_title
## 0 0
## num_voted_users cast_total_facebook_likes
## 0 0
## actor_3_name facenumber_in_poster
## 0 6
## plot_keywords movie_imdb_link
## 0 0
## num_user_for_reviews language
## 0 0
## country content_rating
## 0 0
## budget title_year
## 0 0
## actor_2_facebook_likes imdb_score
## 2 0
## aspect_ratio movie_facebook_likes
## 55 0
We remove aspect_ratio because 1. it has a lot of missing values and 2. we will not be looking into the impact that it has on other data (we assume that it doesn’t).
Add a Column
Gross and Budget
We have gross and budget information. So let’s add two columns: profit and percentage return on investment for further analysis.
Remove (Some) Columns
Colour
Next, we take a look at the influence of colour vs black and white.
##
## Black and White Color
## 2 124 3680
Since 3.4% of the data is in black and white, we can remove the color column it.
Language
Let’s have a look at the different languages contained within the dataset.
##
## Aboriginal Arabic Aramaic Bosnian Cantonese
## 2 2 1 1 1 7
## Czech Danish Dari Dutch English Filipino
## 1 3 2 3 3644 1
## French German Hebrew Hindi Hungarian Indonesian
## 34 11 2 5 1 2
## Italian Japanese Kazakh Korean Mandarin Maya
## 7 10 1 5 14 1
## Mongolian None Norwegian Persian Portuguese Romanian
## 1 1 4 3 5 1
## Russian Spanish Thai Vietnamese Zulu
## 1 24 3 1 1
Almost 95% movies are in English, which means this variable is nearly constant. Let’s remove it.
Country
Next, we can look at the different types of countries.
##
## Afghanistan Argentina Aruba Australia Belgium
## 1 3 1 40 1
## Brazil Canada Chile China Colombia
## 5 63 1 13 1
## Czech Republic Denmark Finland France Georgia
## 3 9 1 103 1
## Germany Greece Hong Kong Hungary Iceland
## 79 1 13 2 1
## India Indonesia Iran Ireland Israel
## 5 1 4 7 2
## Italy Japan Mexico Netherlands New Line
## 11 15 10 3 1
## New Zealand Norway Official site Peru Philippines
## 11 4 1 1 1
## Poland Romania Russia South Africa South Korea
## 1 2 3 3 8
## Spain Taiwan Thailand UK USA
## 22 2 4 316 3025
## West Germany
## 1
Around 79% movies are from USA, 8% from UK, 13% from other countries. So we group other countries together to make this categorical variable with less levels: USA, UK, Others.
# Grouping countries
levels(movie_metadata$country) <- c(levels(movie_metadata$country), "Others")
movie_metadata$country[(movie_metadata$country != 'USA')&(movie_metadata$country != 'UK')] <- 'Others'
movie_metadata$country <- factor(movie_metadata$country)
table(movie_metadata$country)##
## Others UK USA
## 465 316 3025
Now that we’ve cleaned up our dataset, we can now continue to explore our data even further! In the next section we will be looking at genres, movie popularity, gross, profit, and many more other aspects pertinent to our data.
Analysing Data
When inspecting a dataset of movies over the past few years, various interesting inferences can be uncovered. A movie may have a high rating yet low return on investment. Which genre is the most successful? Which actors are the most popular? These are some of the questions we aim to answer in this section.
We can start by performing basic analyis on our data. Thereafter, we delve a bit deeper into more specific parts of the dataset, in hopes of uncovering interesting observations.
Basic Analysis
Let’s first have a look at the number of movies that are produced over the years.
# Plotting the number of movies released
ggplot(movie_metadata, aes(title_year)) +
geom_bar() +
labs(x = "Year movie was released", y = "Movie Count", title = "Number of Movies Released Per Year (1916 - 2016)") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_vline(xintercept=c(1980), linetype="dotted") +
ggplot2::annotate("text", label = "Year = 1980",x = 1979, y = 50, size = 3, colour = "blue", angle=90)From the graph, we see there aren’t many records of movies released before 1980. It’s better to remove those records because they might not be representative of the data.
Let’s have a look at the movie counts now:
# Plotting the number of movies released since 1980
ggplot(movie_metadata, aes(title_year)) +
geom_bar() +
labs(x = "Year movie was released", y = "Movie Count", title = "Number of Movies Released Per Year (1980 - 2016)") +
theme(plot.title = element_text(hjust = 0.5))The graph above illustrate the number of movies released for the period 1980 - 2016. As we can see, from the 1980’s a quick and exponential rise of movies released occurred.
Movie Genre Analysis
Now we can delve into more specific things regarding movies, like genres.
Top Genres
genre = movie_metadata['genres']
# Make genre a dataframe
genre = data.frame(table(genre))
# Order genres based on frequency
genre = genre[order(genre$Freq,decreasing=TRUE),]
# Top 20 genres with the most movies
ggplot(genre[1:20,], aes(x=reorder(factor(genre), Freq), y=Freq, alpha=Freq)) +
geom_bar(stat = "identity", fill="maroon") +
geom_text(aes(label=Freq),hjust=1.2, size=3.5)+
xlab("Genre") +
ylab("Number of Movies") +
ggtitle("Top 20 Genres with the most movies") +
coord_flip()From the above a combination of Comedy, Romance, and Drama appears to be, by far, the most frequent produced genres. As you can see, movies have multiple genres that they are associated with. For analysis purposes, we choose to use the first word in the genre column, as this is likely to be the most accurate description of the movie.
Split Genres
Here we first split the genres into multiple columns and merge them together.
## [1] "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy"
## [3] "Action|Adventure|Thriller" "Action|Thriller"
## [5] "Action|Adventure|Sci-Fi" "Action|Adventure|Romance"
Let’s split the genres separated by “|” into 8 different columns.
# Split on "|"
genres_split <- str_split(movie_metadata$genres, pattern="[|]", n=2)
# Create Matrix
genres_matrix <- do.call(rbind, strsplit(movie_metadata$genres, '[|]'))
# Dataframe of genres
genres_df <- as.data.frame(genres_matrix)genre_df consists of 8 columns, each with different genres. Let’s have a look at the frequency of ALL the genres.
# Collapse all genres into one column
genres_one_col <- gather(genres_df) %>%
select(value)
# Top 30
top30 <- genres_one_col %>%
group_by(value) %>%
tally() %>%
filter(n >= 30)
# Plot frequency of first column
top30 %>%
ggplot(aes(x=reorder(factor(value), n), y=n, alpha=n)) +
geom_bar(stat="identity", fill="maroon") +
geom_text(aes(label=n),hjust=1.2, size=3.5) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Genre") +
ylab("Frequency") +
ggtitle("Movie Genre Frequency") +
coord_flip()It is evident that the Drama and Comedy genre are still the most popular to be produced. It is also interesting to note that Romance is fairly lower on the list as in the previous graph. This may imply that most movies predominantly co-occur with Comedy or Drama. Romance movies do not co-occur with any of the other genres as frequently as Comedy and Drama do. Additionally, the fact that Comedy and Drama occurs the most does not necessarily mean that they are the most profitable, returning successful ROI’s. We will try and explore this later on in the report.
Previously we assumed that the first genre is the most applicable, therefore, we choose the first column as the genre for the movie and append it to the dataframe.
# Remove old genre column
movie_metadata <- subset(movie_metadata, select = -c(genres))
# Take first column of genres_df and add it to MAIN df
movie_metadata$genre <- genres_df$V1How does this distribution look like over the years? Lets have a look at the frequency of genres between the period of 1980 and 2016.
# Plotting the movie genres produced
movie_metadata %>%
group_by(title_year, genre) %>%
summarise(count = n()) %>%
ggplot(aes(title_year, as.factor(genre))) +
geom_tile(aes(fill=count),colour="white") +
scale_fill_gradient(low="light blue",high = "dark blue") +
xlab("Year of Movie") +
ylab("Genre of Movie") +
ggtitle("Heat Map of Movie Genres Produced Over the Years") +
theme(panel.background = element_blank())We can make one or two remarks from this heatmap: Firstly, we can see that Action, Adventure,Comedy, and Drama are genres that are predominantly used as the first term to associate with a movie. Secondly, we can see that Romance, which was previously high in frequency for co-occurring with Comedy and Drama is now very low. This means that there are very few movies that are predominantly Romance, meaning Romance is mostly the second, third, etc term to describe a movie.
Additionally, we can also see that some genres, like Action and Comedy have picked up over the years. This is evident by looking at the darker shades of blue becoming more prominent in the latter years.
Which Genres are Popular?
In our dataset we have Facebook Likes and IMDB scores associated with a movie. This can help give us an indication of how popular each genre is. There are different types of Facebook likes within the dataset, namely director_facebook_likes, actor_3_facebook_likes, actor_1_facebook_likes, cast_total_facebook_likes, actor_2_facebook_likes, and finally movie_facebook_likes. Let’s add them all together for each movie.
# Add column for total facebook likes
movie_metadata$total_facebook_likes <- movie_metadata$director_facebook_likes +
movie_metadata$actor_3_facebook_likes + movie_metadata$actor_1_facebook_likes +
movie_metadata$cast_total_facebook_likes + movie_metadata$actor_2_facebook_likes +
movie_metadata$movie_facebook_likesNow let’s calculate the average IMDB score, average user votes, average Facebook likes and average number of reviewers for each genre.
# creating a data frame containing avg score, avg votes and avg fb likes
score_votes_likes <- movie_metadata %>%
group_by(genre) %>%
summarise(count = n(),
avg_score = round(mean(imdb_score), 1),
avg_votes = mean(num_voted_users),
avg_facebook_likes = mean(total_facebook_likes),
avg_reviews = mean(num_user_for_reviews)) %>%
filter(count > 10)
# arranging data frame by average score
arr_score <- arrange(score_votes_likes, desc(avg_facebook_likes))
# Show data frame
as.data.frame(score_votes_likes)## genre count avg_score avg_votes avg_facebook_likes avg_reviews
## 1 Action 937 6.3 145049.12 38762.692 462.1942
## 2 Adventure 356 6.5 133549.07 40494.340 355.0365
## 3 Animation 43 6.7 83318.51 30471.674 202.0698
## 4 Biography 200 7.1 100233.72 41000.900 260.4650
## 5 Comedy 997 6.1 64763.60 23896.765 197.9388
## 6 Crime 247 6.9 130843.95 33857.397 362.5789
## 7 Documentary 32 6.8 16159.19 5623.312 126.3750
## 8 Drama 667 6.8 85462.68 31889.940 319.7301
## 9 Fantasy 37 6.3 87311.14 25452.270 374.7027
## 10 Horror 152 5.7 66091.12 21356.158 406.8750
## 11 Mystery 24 6.6 180824.04 35571.625 570.2500
The most liked genres on Facebook is Biography (41000.900), Adventure (40494.340), and Action (38762.692). The most popular genres, based on IMDB scores, are Biography (7.9), Crime (6.9), Documentary (6.8), and Drama (6.8). It may be interesting to note that previously we saw that Comedy and Drama are produced the most frequently overall - yet it seems like the public scores Biography, Crime, Documentary, and Drama movies consistently high. This may also imply that there could be a vast range in scores for Comedy and Drama, justifying a fairly lower score than the rest. Furthermore, Comedy, Action, Drama have a very high number of voters - which may account for the lower overall average.
Popularity Analysis
IMDB ratings VS Movie Count
Let’s have a look at the IMDB rating distribution on the number of movies that are produced. Below we can see that the data is slightly skewed to the left. We have a concentration of data among 6 out of 10 and a long tail to the left. The vast majority of the movies are given a score between 5 and 7.5, with fewer movies scoring higher than that.
# Plotting the IMDB ratings vs movie count
ggplot(movie_metadata, aes(imdb_score)) +
geom_histogram(bins = 50) +
geom_vline(xintercept = mean(movie_metadata$imdb_score,na.rm = TRUE),colour = "blue") +
ylab("Movie Count") +
xlab("IMDB Rating") +
scale_x_continuous(limits = c(0, 10)) +
ggtitle("IMDB Ratings for Movies") +
ggplot2::annotate("text", label = "Mean IMDB rating",x = 6.2, y = 50, size = 3, colour = "yellow",angle=90)The lowest scored movie, titled Justin Bieber: Never Say Never, is 1.6 whereas the highest score is 9.3 for The Shawshank Redemption. The mean of the imdb scores is 6.433288.
Popularity over the years
Let’s take a look at when Facebook started. From the graph it is clear that the number of Facebook likes for movies released post 2004 have increased dramatically. Old movies receive fewer likes, which is most likely due to Facebook marketing newer movies.
#Creating the required subset of data
movies_pop <- movie_metadata %>%
select(title_year, movie_facebook_likes) %>%
filter(title_year > 1980) %>%
group_by(title_year) %>%
summarise(avg = mean(movie_facebook_likes))
#Generating the popularity vs time plot
ggplot(movies_pop, aes(x = title_year, y = avg)) +
geom_point() +
geom_smooth() +
geom_vline(xintercept = c(2004),colour = c("blue")) +
ylab("Average Facebook Likes") +
xlab("Years Movie Was Produced") +
ggplot2::annotate("text", label = "Facebook",x = 2003.5, y = 10000, size = 3, colour = "blue",angle=90)Facebook Likes VS IMDB Score
# Plotting the Facebook Likes VS IMDB Score
movie_metadata %>%
group_by(content_rating) %>%
summarise(avg_fb_likes = mean(movie_facebook_likes), avg_imdb_score = mean(imdb_score), num=n()) %>%
ggplot() +
geom_point(aes(x=avg_fb_likes, y=avg_imdb_score, color=content_rating, size=num), stat="identity") +
scale_y_continuous(limits = c(0, 10)) +
xlab("Average Facebook Likes") +
ylab("Average imdb score")There are three things to look at with this graph: Average IMDB score, average Facebook likes and the number of movies rated per content-rating. We can see that all movies have (on average) very similar IMDB scores, however, they differ higly on the number of Facebook likes. For example, movies rated PG-13 receive much more Facebook likes (on average) than movies rated NC-17. Movies rated R receive (on average) more likes than movies rated NC-17, but still have relatively similar IMDB scores.
IMDB Scores VS Facebook Likes
We can infer a strong correlation between a movie’s Facebook likes and its IMDB Score. This is expected, as a higher individual rating relates to higher viewer satisfaction; and hence it is expected to see an increase in positive online presence. Initially, this graph was constructed to see if there’d be a difference between viewer enjoyment and movie rating. Movie databases are often critisized for the nature of their rating scales, made by critics and placing priority on sentiments and plot, which may not fully coincide with viewer enjoyment. However, as seen below, this is not the case using IMDB’s Scoring.
#Plotting Facebook likes against IMDB score
ggplot(data = movie_metadata, aes(x = imdb_score, y = movie_facebook_likes)) +
scale_x_continuous(limits = c(0, 10)) +
geom_smooth()Top 20 directors with highest average IMDB score
Let’s take a look at the directors. We can see that the top IMDB rated directors have very similar scores (8.1 - 8.6). Tony Kaye has the highest rating of 8.6.
# Displaying the avg_imdb score of directors
movie_metadata %>%
group_by(director_name) %>%
summarise(avg_imdb = mean(imdb_score)) %>%
arrange(desc(avg_imdb)) %>%
top_n(20, avg_imdb) %>%
formattable(list(avg_imdb = color_bar("orange")), align = 'l')| director_name | avg_imdb |
|---|---|
| Tony Kaye | 8.600000 |
| Damien Chazelle | 8.500000 |
| Majid Majidi | 8.500000 |
| Ron Fricke | 8.500000 |
| Christopher Nolan | 8.425000 |
| Asghar Farhadi | 8.400000 |
| Marius A. Markevicius | 8.400000 |
| Richard Marquand | 8.400000 |
| Sergio Leone | 8.400000 |
| Lee Unkrich | 8.300000 |
| Lenny Abrahamson | 8.300000 |
| Pete Docter | 8.233333 |
| Hayao Miyazaki | 8.225000 |
| Joshua Oppenheimer | 8.200000 |
| Juan José Campanella | 8.200000 |
| Quentin Tarantino | 8.200000 |
| David Sington | 8.100000 |
| Je-kyu Kang | 8.100000 |
| Terry George | 8.100000 |
| Tim Miller | 8.100000 |
Vote Counts
The IMDB rating system was first implemented in the 1990’s. Social media platforms like Facebook had started in the mid 2000’s. It is evident that Facebook influenced the number of rates per movie far higher than that of IMDB.
#Performing operations on Movies Vote Count over the years
movies_vote1 <- movie_metadata %>%
select(title_year, num_voted_users) %>%
group_by(title_year) %>%
summarise(count = sum(num_voted_users))
ggplot(movies_vote1, aes(x = title_year, y = count/1000)) +
geom_bar( stat = "identity") +
geom_vline(xintercept = c(1990,2004),colour = c("orange","blue")) +
ylab("Vote count (in thousands)") +
xlab("Years") +
ggplot2::annotate("text", label = "Facebook",x = 2003, y = 21000, size = 3, colour = "blue",angle=90) +
ggplot2::annotate("text", label = "IMDB",x = 1989, y = 21000, size = 3, colour = "orange",angle=90)Vote Counts VS IMDB score
# breaking num_users_voted into 4 buckets
movie_metadata$vote_bucket <- cut(movie_metadata$num_voted_users,
c(0, 50000, 100000, 300000, 500000))# plotting a boxplot for votes vs imdb_score
bp <- na.omit(movie_metadata) %>%
ggplot(aes(x = vote_bucket, y = imdb_score)) +
geom_boxplot(aes(fill = vote_bucket), alpha = 0.7,
show.legend = F) +
stat_summary(fun.y = mean, geom = "point",
shape = 23) +
coord_flip() +
xlab("User Votes") +
ylab("IMDB Score") +
ggtitle("Box plot for User Votes vs IMDB Score") +
theme_minimal() +
theme(plot.title = element_text(size = 14,
color = "darkgrey",
family = "Times"))
ggplotly(bp) %>%
layout(margin = m,
xaxis = a,
yaxis = a,
legend = list(orientation = "h", size = 4,
bgcolor = "#E2E2E2",
bordercolor = "darkgrey",
borderwidth = 1,
x = 0,
y = -0.3)) # scatter plot for user votes vs imdb score
scatter_plot(movie_metadata$num_voted_users, movie_metadata$imdb_score,
"User Votes",
"IMDB Score",
"Scatter plot for User Votes vs IMDB Score",
"User Votes:",
"<br>IMDB Score:",
"<br>Title:",
alpha = 0.3)Average score, avg votes and, avg user reviews
Each line represents an average, for the imdb score, votes and user review.
# creating a data frame for average score, avg votes and, avg user reviews
# by title year
scat_year_score <- movie_metadata %>%
group_by(title_year) %>%
summarise(count = n(),
IMDB_Score = round(mean(imdb_score),1),
avg_votes = mean(num_voted_users),
avg_user_review = round(mean(num_user_for_reviews)))
# plotting line graph for Avg score by title year
lp1 <- line_graph(scat_year_score,
scat_year_score$IMDB_Score,
"Average Score")
# plotting line graph for Avg votes by title year
lp2 <- line_graph(scat_year_score,
scat_year_score$avg_votes,
"Average Votes")
# plotting line graph for Avg reviews by title year
lp3 <- line_graph(scat_year_score,
scat_year_score$avg_user_review,
"Average User Review")
subplot(lp1, lp2, lp3, nrows = 3, heights = c(0.33, 0.33, 0.33))Profit | Gross | Return of Interest
# Plotting movie budget and gross
budget <- movie_metadata %>%
select(movie_title, budget) %>%
arrange(desc(budget)) %>%
head(15)
x <- ggplot(budget, aes(x = reorder(movie_title, -desc(budget)), y = budget/1000000)) +
geom_bar( stat = "identity")+
theme(axis.text.x=element_text(hjust=1))+
ggtitle("Movie Budgets")+
xlab("")+
ylab("Budget (in Millions)") +
coord_flip()
rev <- movie_metadata %>%
select(movie_title, gross) %>%
arrange(desc(gross)) %>%
head(15)
y <- ggplot(rev, aes(x = (reorder(movie_title, -desc(gross))), y = gross/1000000)) +
geom_bar( stat = "identity")+
theme(axis.text.x=element_text(hjust=1))+
ggtitle("Movie Gross")+
xlab("")+
ylab("Gross (in Millions)") +
coord_flip()
ggarrange(x, y,
labels = c("A", "B"),
ncol = 1, nrow = 2)In the above graph, it is evident that the movies with with higher budgets do not essentially mean that they will equate to a high gross profit.
Most Successful Directors based on Profit
#Top 20 most successful directors
movie_metadata %>%
group_by(director_name) %>%
mutate(profit = gross - budget)%>%
select(director_name, budget, gross, profit) %>%
na.omit() %>%
summarise(films = n(), budget = sum(as.numeric(budget)), gross = sum(as.numeric(gross)), profit = sum(as.numeric(profit))) %>%
mutate(avg_per_film = profit/films) %>%
arrange(desc(avg_per_film)) %>%
top_n(20, avg_per_film) %>%
ggplot( aes(x = films, y = avg_per_film/1000000)) +
geom_point(size = 1, color = "blue") +
geom_text_repel(aes(label = director_name), size = 3, color = "blue") +
xlab("Number of Films") + ylab("Avg Profit $millions") +
ggtitle("Most Successful Directors")Looking at the most succesful directors, one can determine that the most successful director produced only one movie (Tim Miller) or creating an array of succesful films with large budgets, such as James Cameron.
Top 20 movies based on its Profit
#Top 20 movies based on its Profit
movie_metadata %>%
filter(title_year %in% c(2000:2016)) %>%
mutate(profit = gross - budget,
return_on_investment_perc = (profit/budget)*100) %>%
arrange(desc(profit)) %>%
top_n(20, profit) %>%
ggplot(aes(x=budget/1000000, y=profit/1000000)) +
geom_point(size = 2) +
geom_smooth(size = 1) +
geom_text_repel(aes(label = movie_title), size = 3) +
xlab("Budget $million") +
ylab("Profit $million") +
ggtitle("20 Most Profitable Movies")## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
When assessing the top 20 movies based on profit, Avatar has the highest profit margin, regioning in a similar area to director James cameron.
#Top genres based on its Profit
movie_metadata %>%
filter(title_year %in% c(2000:2016)) %>%
mutate(profit = gross - budget,
return_on_investment_perc = (profit/budget)*100) %>%
group_by(genre)%>%
na.omit() %>%
summarise(av_profit = mean(profit), av_budget = mean(budget)) %>%
arrange(desc(av_profit)) %>%
ggplot(aes(x=av_budget/10000, y=av_profit/10000)) +
geom_jitter(size = 2) +
geom_smooth(size = 1) +
geom_text_repel(aes(label = genre), size = 3) +
xlab("Budget $million") +
ylab("Profit $million") +
ggtitle("Average Return for each Genre")## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
As we can see from the above graph, even though comedy and drama are the most popular to be produced, they, overall, are not profitable. The budgets from western and thriller movies are far smaller than that of other genres.
Top 20 movies based on its Return on Investment
#Top 20 movies based on its Return on Investment
movie_metadata %>%
filter(budget >100000) %>%
mutate(profit = gross - budget,
return_on_investment_perc = (profit/budget)*100) %>%
arrange(desc(profit)) %>%
top_n(20, profit) %>%
ggplot(aes(x=budget/1000000, y=return_on_investment_perc)) +
geom_point(size = 2) +
geom_smooth(size = 1) +
geom_text_repel(aes(label = movie_title), size = 3) +
xlab("Budget $million") +
ylab("Percent Return on Investment") +
ggtitle("20 Most Profitable Movies based on its Return on Investment")## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Sucessful directors such as George Lucas also have profitable movies.
Further Analysis
Commercial success Vs Critical acclaim
# Plotting of the commercial success vs critical accliam
movie_metadata %>%
top_n(15, profit) %>%
ggplot(aes(x = imdb_score, y = gross/10^6, size = profit/10^6, color = content_rating)) +
geom_point() +
geom_hline(aes(yintercept = 600)) +
geom_vline(aes(xintercept = 7.75)) +
geom_text_repel(aes(label = movie_title), size = 4) +
xlab("Imdb score") +
ylab("Gross money earned in million dollars") +
ggtitle("Commercial success Vs Critical acclaim") +
ggplot2::annotate("text", x = 8.5, y = 700, label = "High ratings \n & High gross") +
theme(plot.title = element_text(hjust = 0.5))In the above graph we can compare content rating to the content rating inference as well as the higher grossing films with sucessful directors.
The top 20 most popular key word
# Plotting the 20 most popular keywords
movie_metadata %>%
filter(imdb_score >6 ) %>% #Filter to show the graph only for vote_average values greater than 6
na.omit() %>%
ggplot(aes(x = gross/1000000, y = content_rating), height=0) +
geom_jitter(alpha = 0.5, col = "darkgreen") +
theme(axis.text.x=element_text(angle = 90, hjust = 1))+
ggtitle("Gross vs Content Rating") +
xlab("Gross (millions)") +
ylab("Rating") +
geom_smooth() +
coord_flip()Correlation Heatmap
# Plotting the heatmap
ggcorr(movie_metadata, label = TRUE, label_round = 2, label_size = 2.8, size = 2, hjust = .85) +
ggtitle("Correlation Heatmap") +
theme(plot.title = element_text(hjust = 0.5))## Warning in ggcorr(movie_metadata, label = TRUE, label_round = 2, label_size
## = 2.8, : data in column(s) 'director_name', 'actor_2_name', 'actor_1_name',
## 'movie_title', 'actor_3_name', 'plot_keywords', 'movie_imdb_link',
## 'language', 'country', 'content_rating', 'genre', 'vote_bucket' are not
## numeric and were ignored
Based on the heatmap, we can see some high correlations (greater than 0.7) between predictors.
According to the highest correlation value 0.95, we find actor_1_facebook_likes is highly correlated with the cast_total_facebook_likes, and both actor2 and actor3 are also somehow correlated to the total. So we want to modify them into two variables: actor_1_facebook_likes and other_actors_facebook_likes.
There are high correlations among num_voted_users, num_user_for_reviews and num_critic_for_reviews.
Sentiment Analysis
Exploration into Movie Plot Keywords
Keywords from movie plot lines are given in a single column, split by a “|”. The following code will seperate these words into seperate columns.
#Constructing Top 20
keywords_split <- str_split(IMDB$plot_keywords, pattern="[|]", n=5)
keywords_matrix <- do.call(rbind, strsplit(as.character(IMDB$plot_keywords), "[|]"))
keywords_df <- as.data.frame(keywords_matrix)
names(keywords_df) <- c("one", "two", "three", "four", "five")
keywords_one_col <- gather(keywords_df) %>%
select(value)
keywords_one_col_freq <- keywords_one_col %>%
group_by(value) %>%
tally()
top_20 <- keywords_one_col_freq %>%
select(value, n) %>%
top_n(20)## Selecting by n
keywords_one_col %>%
group_by(value) %>%
tally() %>%
filter(n > 30) %>%
ggplot() +
geom_bar(aes(x = value, y=n), stat="identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Keyword") +
ylab("Frequency") +
ggtitle("Frequency of common keywords")
Above we can see the most popular keywords describing the films in the dataset.
The following code checks to see which movies contain common keywords.
#Placing Top 20 Words Against Movie Success
IMDB_true_false <- IMDB
for (keyword in top_20$value) {
IMDB_true_false <- cbind(IMDB_true_false, ifelse(str_detect(IMDB$plot_keywords, keyword), "TRUE", "FALSE"))
}
for (i in 1:20) {
reference <- 27
names(IMDB_true_false)[reference + i] <- top_20$value[i]
}
does_contain_common_key <- data.frame()
for (keyword in top_20$value) {
does_contain_common_key <- rbind(does_contain_common_key, IMDB_true_false %>%
filter(get(keyword) == TRUE) %>% select(movie_title, gross, imdb_score, movie_facebook_likes, plot_keywords)) %>%
distinct(movie_title, .keep_all = T)
}
does_not_contain_common_key <- data.frame()
for (keyword in top_20$value) {
does_not_contain_common_key <- rbind(does_not_contain_common_key, IMDB_true_false %>%
filter(get(keyword) == FALSE) %>% select(movie_title, gross, imdb_score, movie_facebook_likes, plot_keywords)) %>%
distinct(movie_title, .keep_all = T)
}## movie_title plot_keywords
## 1 The Doors death|paris france|rock band|singer|the doors
## 2 For a Good Time, Call... friend|friendship|gay|phone sex|sex
Here are two movies that contain at least one common plot keyword.
## movie_title
## 1 The Mothman Prophecies
## 2 The Girl Next Door
## plot_keywords
## 1 car accident|death of wife|mothman|point pleasant west virginia|urban legend
## 2 forced to strip|male rear nudity|porn actress|scantily clad female|teenager
Here are two movies that do not contain any of the common plot key words.
does_contain_common_key <- does_contain_common_key %>%
mutate(tri = "Top 20 Word")
does_not_contain_common_key <- does_not_contain_common_key %>%
mutate(tri = "NOT Top 20 Word")
all_movies_keywords_indicated <- full_join(does_not_contain_common_key, does_contain_common_key, by = c("movie_title", "gross", "imdb_score", "movie_facebook_likes", "plot_keywords"))
all_movies_keywords_indicated <- all_movies_keywords_indicated %>%
mutate(type = coalesce(tri.y, tri.x)) %>%
select(movie_title, gross, imdb_score, movie_facebook_likes, plot_keywords, type)
all_movies_keywords_indicated <- all_movies_keywords_indicated %>%
group_by(type) %>%
na.omit() %>%
mutate(avg_gross = mean(gross))summarise(all_movies_keywords_indicated, avg_gross = mean(gross)) %>%
ggplot() +
geom_bar(aes(x = type, y=avg_gross, fill=type), stat="identity", position="stack") +
theme(axis.text.x = element_blank()) +
xlab("Type of movie") +
ylab("Average Gross")
This graph is comparing the average gross of a movie that contains at least one keyword from the list of top 20 most common plot keyword and movies that contain none of these popular keywords.
The average gross for movies that do not contain one of the top 20 most common keywords appear to be higher.
Sentiment pre-processing
The split keywords are needed in order to perform sentiment.
Calculating Sentiment
Below is where the sentiment of keywords is calculated. The sentiment function used comes from the syuzhet package and it can detect the presence of eight different emotions, namely “anger”, “anticipation”, “disgust”, “fear”, “joy”, “sadness”, “surprise” and, “trust”. It is also able to calculate positive and negative valence.
#Function for sentiment per year
yearly_sentiment <- function(year, df) {
amount <- nrow(df %>%
select(Year) %>%
filter(Year == year))
df <- df %>%
filter(Year == year)
sentiments <- get_nrc_sentiment(as.character(df[4]))
for (i in 1:length(sentiments)) {
sentiments[i] <- sentiments[i]/amount
}
year_sentiment <- cbind(year, sentiments)
return (year_sentiment)
}sentiments <- data.frame()
#For-loop to capture all years
for (i in all_years$Year) {
sentiments <- rbind(sentiments,yearly_sentiment(i, keywords_from_split))
}Sentiment results
As previously mentioned, thre number of movies dramatically increases from 1980 onwards so the calculated sentiments are filtered to reduce sensitivity of the data.
#Sort by year
sentiments_emotions <- sentiments[with(sentiments, order(year)), ] %>%
filter(year >= 1980) %>%
select(-positive, -negative)#Heatmap for sentiments
rnames <- sentiments_emotions[,1]
mat_sentiments <- data.matrix(sentiments_emotions[,2:ncol(sentiments_emotions)])
rownames(mat_sentiments) <- rnames
mat_sentiments <- t(mat_sentiments)
df_sentiment <- as.data.frame(mat_sentiments)
names_emotions <- c("anger", "anticipation", "disgust","fear","joy","sadness","surprise","trust")
sentiments_graph <- cbind(names_emotions, df_sentiment)#Heatmap
heatmap.2(mat_sentiments, Rowv=NA, Colv=NA, scale="row", col=colorRampPalette(c("white","darkblue")), margins=c(5,10), trace = "none")
What is immeadiately visible is the differences in type of emotions in plot keywords has reduced over the years. The movies
Below the trend of positive and negative sentiment is explored.
sentiments_filter %>%
ggplot(aes(x = year)) +
geom_line(aes(y = negative, color = 'postive')) +
geom_line(aes(y = positive, color = 'negative')) +
ylab("Sentiment score") +
xlab("Year") +
ggtitle("Positive and negative sentiment of keywords across the years")
Generally speaking, the sentiment of common keywords used has been more positive, but has remained relatively constant, apart from the decrease in both positive and negative sentiment from around 1995 to 2010.
Network Analysis
pacman::p_load(stringr,ggplot2, tidyr, ngram, dplyr, igraph, ggraph, visNetwork, tidygraph, graphlayouts,ggpubr, ggrepel, ggridges, viridis, network, reshape, tidytext)
setwd(getwd())
options(scipen = 999)Constructing the network graph
Actors will be the nodes. Edges exist only if the actors have appeared in a movie together.
##extract the actors
actors <- imdb %>%
select(actor_1_name, actor_2_name, actor_3_name)
head(actors, 5)## actor_1_name actor_2_name actor_3_name
## 1 CCH Pounder Joel David Moore Wes Studi
## 2 Johnny Depp Orlando Bloom Jack Davenport
## 3 Christoph Waltz Rory Kinnear Stephanie Sigman
## 4 Tom Hardy Christian Bale Joseph Gordon-Levitt
## 5 Doug Walker Rob Walker
actors <- actors %>%
filter(actor_1_name != "") %>%
filter(actor_2_name != "") %>%
filter(actor_3_name != "") The nodelist will only contain each actor’s name once.
##make the nodelist
actor_nodes <- actors %>%
gather() %>%
select(value) %>%
distinct(value)
head(actor_nodes, 5)## value
## 1 CCH Pounder
## 2 Johnny Depp
## 3 Christoph Waltz
## 4 Tom Hardy
## 5 Daryl Sabara
Because each movie has three top actors given, some column manipulation is needed to format the data into the two “to” and “from” columns required for the edgelist.
##make the edgelist for actor_1 and actor_2
temp_edges_1_2 <- actors %>%
select(actor_1_name, actor_2_name) %>%
na.omit() %>%
dplyr::rename(from = actor_1_name, to = actor_2_name)
temp_edges_1_2[temp_edges_1_2==""] <- NA
temp_edges_1_2[temp_edges_1_2==" "] <- NA
temp_edges_1_2 <- temp_edges_1_2 %>%
na.omit()
head(temp_edges_1_2, 5)## from to
## 1 CCH Pounder Joel David Moore
## 2 Johnny Depp Orlando Bloom
## 3 Christoph Waltz Rory Kinnear
## 4 Tom Hardy Christian Bale
## 5 Daryl Sabara Samantha Morton
## edgelist for actor_1 and actor_3
temp_edges_1_3 <- actors %>%
select(actor_1_name, actor_3_name) %>%
na.omit() %>%
dplyr::rename(from = actor_1_name) %>%
dplyr::rename(to = actor_3_name)
temp_edges_1_3[temp_edges_1_3==""] <- NA
temp_edges_1_3[temp_edges_1_3==" "] <- NA
##remove both values if there is even one NA present, eg Tom Hardy -> NA
temp_edges_1_3 <- temp_edges_1_3%>%
na.omit()
head(temp_edges_1_3, 5)## from to
## 1 CCH Pounder Wes Studi
## 2 Johnny Depp Jack Davenport
## 3 Christoph Waltz Stephanie Sigman
## 4 Tom Hardy Joseph Gordon-Levitt
## 5 Daryl Sabara Polly Walker
## edgelist for actor_2 and actor_3
temp_edges_2_3 <- actors %>%
select(actor_2_name, actor_3_name) %>%
na.omit() %>%
dplyr::rename(from = actor_2_name) %>%
dplyr::rename(to = actor_3_name)
temp_edges_2_3[temp_edges_2_3==""] <- NA
temp_edges_2_3[temp_edges_2_3==" "] <- NA
temp_edges_2_3 <- temp_edges_2_3 %>%
na.omit()
head(temp_edges_2_3, 5)## from to
## 1 Joel David Moore Wes Studi
## 2 Orlando Bloom Jack Davenport
## 3 Rory Kinnear Stephanie Sigman
## 4 Christian Bale Joseph Gordon-Levitt
## 5 Samantha Morton Polly Walker
##Combine the three sets of edges
actor_edges <- data.frame(from = "", to = "")
actor_edges <- do.call("rbind", list(temp_edges_1_2, temp_edges_1_3, temp_edges_2_3))
temp_edges_1_2 = NULL
temp_edges_1_3 = NULL
temp_edges_2_3 = NULL
head(actor_edges)## from to
## 1 CCH Pounder Joel David Moore
## 2 Johnny Depp Orlando Bloom
## 3 Christoph Waltz Rory Kinnear
## 4 Tom Hardy Christian Bale
## 5 Daryl Sabara Samantha Morton
## 6 J.K. Simmons James Franco
Here is a simple network plot of the constructed network.
Here a densely connected core can be seen surrounded by many small nodes that are not connected to the main component.
##write to graphml for Gephi purposes
write.graph(actors_in_same_movies, "../data-out/graphs/actors_in_same_movies.graphml", format=c("graphml"))Network measures
Eigenvector centrality (also called eigencentrality) is a measure of the influence of a node in a network. It assigns relative scores to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes.
e_values <- readRDS(file="../../g_eigen_values.RDs")
projected <- projected %>%
set_vertex_attr(name = "g_e_values", value = e_values$values)## Warning in vattrs[[name]][index] <- value: number of items to replace is
## not a multiple of replacement length
e_values['values'] %>%
as.data.frame() %>%
ggplot()+
geom_density(aes(values)) +
xlab("Eigen value")
Eigen values are negatively skewed for global community 1. This means that most nodes are not connected to high scoring nodes.
In a connected graph, the normalized closeness centrality (or closeness) of a node is the average length of the shortest path between the node and all other nodes in the graph. Thus the more central a node is, the closer it is to all other nodes. (stolen from Wiki)
An actor will be well connected if other many actors can be reached in a short number of hops.
close_cent <- projected %>%
as_tbl_graph() %>%
activate(nodes) %>%
igraph::closeness()
var_cc <- mean(close_cent)
projected <- projected %>%
set_vertex_attr(name = "g_close_cent", value = close_cent)
close_cent %>%
as.data.frame() %>%
dplyr::rename(closeness = '.') %>%
ggplot()+
geom_density(aes(closeness)) +
xlab("Centrality (Closeness)") +
ylab("Frequency")
The closeness distribution is very interesting. There is a high number of nodes with a relatively high and relatively low closeness. This is due to the graph having many small components and one very densely connected large component. The mean value is 0.0000001.
Interpretively, the Boncich power measure corresponds to the notion that the power of a vertex is recursively defined by the sum of the power of its alters. The nature of the recursion involved is then controlled by the power exponent: positive values imply that vertices become more powerful as their alters become more powerful (as occurs in cooperative relations), while negative values imply that vertices become more powerful only as their alters become weaker (as occurs in competitive or antagonistic relations). (stolen from Wiki)
Essentially, the importance of an actor is defined by the importance of alters, or other connected actors.
power_cent <- projected %>% power_centrality(exponent = 0.9)
var_pc <- max(power_cent)
projected <- projected %>%
set_vertex_attr(name = "g_power_cent", value = power_cent)
power_cent %>%
as.data.frame() %>%
dplyr::rename(power = '.') %>%
mutate(power = as.numeric(power)) %>%
ggplot() +
geom_density(aes(x=power))
So the distribution of Boncich power is slightly positively skewed meaning that in general, vertices are considered more ‘powerful’ as their alters increase in power. The max power centrality is 14.7905526.
The PageRank algorithm ignores edge weights when calculating the importance of nodes. The more likely an actor will be found when randomly searching through movies, the higher the assigned PageRank.
page_ranks <- projected %>%
page_rank()
var_pr <- mean(page_ranks$vector)
projected <- projected %>%
set_vertex_attr(name = "g_page_rank", value = page_ranks$vector)
page_ranks$vector %>%
as.data.frame() %>%
dplyr::rename(page_r = '.') %>%
mutate(page_r = as.numeric(page_r)) %>%
ggplot() +
geom_density(aes(x=page_r)) +
xlab("Page Rank")
Most nodes have a relatively low page rank. The mean page rank is 0.0001859.
Community Detection
Group Louvain optimises for modularity in the network and therefore tries to create densely connected clusters with sparse connections between the clusters. The densely connected core may not allow for very modular communities to be identified.
node_comms <- as_tbl_graph(projected) %>%
activate(nodes) %>%
mutate(global_comm = group_louvain(weights = weight)) %>%
as.data.frame()
projected <- projected %>%
set_vertex_attr("comm", value = node_comms$global_comm)Distribution of community size
node_comms <- projected %>%
as_tbl_graph() %>%
as.data.frame()
node_comms %>%
ggplot() +
geom_bar(aes(x=comm))+
scale_y_log10()+
NULL
This graph shows the distribution of community size. The community size is exponentially distributed, resulting in a few large communities and many smaller ones. Some form of filtering on community size is needed to remove the smaller communities.
node_comms %>%
filter(comm < 55) %>%
ggplot() +
geom_bar(aes(x=comm))+
geom_vline(xintercept=c(17.5), linetype="dotted")+
NULL
After removing communities smaller than 100 actors, only 17 communities remain.
node_comms_filtered <- node_comms %>%
filter(comm <= 53)
projected <- set_vertex_attr(projected, name = "Label", value = projected$name)## Warning in length(vattrs[[name]]) <- vc: length of NULL cannot be changed
Visualising the network
The 17 remaining communities were analysed in gephi. The nodes are coloured and grouped by community, while the size of the node and text are dependent on the degree of the node.
It is clear that the remaining communities are very densely connected meaning that even after optimising for modularity, actors have many connections outside their community. These dense connections may have negatively impacted the results of the Group Louvain and there is concern as to the true modularity of these communities.
top_comms_nodes <- node_comms %>%
add_count(comm) %>%
arrange(name, desc(n)) %>%
group_by(comm) %>%
distinct %>%
top_n(5) %>%
ungroup() %>%
arrange(n) %>%
filter(n > 100) %>%
arrange(desc(n)) %>%
na.omit()
top_comms_nodes## # A tibble: 3,890 x 7
## name g_e_values g_close_cent g_power_cent g_page_rank comm n
## <chr> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 Adam Brown 41.7 0.000000142 1.94 0.0000394 1 535
## 2 Adam Hicks -2.70 0.000000142 -0.0684 0.000130 1 535
## 3 Adrian Alo… -1. 0.000000142 -0.328 0.0000640 1 535
## 4 Adrian Les… 3.25 0.000000142 1.65 0.0000517 1 535
## 5 Aida Turtu… -1. 0.000000142 -1.09 0.000100 1 535
## 6 Aimee Teeg… 1.23 0.000000142 -0.304 0.000153 1 535
## 7 AJ Michalka 6.78 0.000000142 0.0789 0.0000823 1 535
## 8 Al Roker -1. 0.000000142 -0.475 0.0000559 1 535
## 9 Alan Young 4.94 0.000000142 0.232 0.0000745 1 535
## 10 Alastair D… -2.25 0.000000142 0.611 0.0000594 1 535
## # … with 3,880 more rows
Analysis of Global Comm 1
g_comm_1_nodes <- top_comms_nodes %>%
filter(comm == 1)
g_comm_1_graph <- subgraph(projected, g_comm_1_nodes$name) e_values_1 <- g_comm_1_graph %>%
as_adjacency_matrix(type="both") %>%
eigen()
g_comm_1_graph <- g_comm_1_graph %>%
set_vertex_attr(name = "local_e_values", value = e_values_1$values)
e_values_1['values'] %>%
as.data.frame() %>%
ggplot()+
geom_density(aes(values)) +
xlab("Eigen value")
Eigen values are positively skewed for global community 1. This means that most nodes are not connected to high scoring nodes.
close_cent <- g_comm_1_graph %>%
closeness()
var <- mean(close_cent)
g_comm_1_graph <- g_comm_1_graph %>%
set_vertex_attr(name = "local_close_cent", value = close_cent)
close_cent %>%
as.data.frame() %>%
dplyr::rename(closeness = '.') %>%
ggplot()+
geom_density(aes(closeness)) +
xlab("Closeness centrality")
The average closeness centrality is 0.0004944. When looking at a single community we expect a higher average closeness than when calculating for the whole graph which was 0.0000001.
power_cent <- g_comm_1_graph %>% power_centrality(exponent = 0.9)
var <- max(power_cent)
g_comm_1_graph <- g_comm_1_graph %>%
set_vertex_attr(name = "local_power_cent", value = power_cent)
power_cent %>%
as.data.frame() %>%
dplyr::rename(power = '.') %>%
mutate(power = as.numeric(power)) %>%
ggplot() +
geom_density(aes(x=power))
The average Boncich power centrality is 6.653664.
page_ranks <- g_comm_1_graph %>%
page_rank()
var <- mean(page_ranks$vector)
g_comm_1_graph <- g_comm_1_graph %>%
set_vertex_attr(name = "local_page_rank", value = page_ranks$vector)
page_ranks$vector %>%
as.data.frame() %>%
dplyr::rename(page_r = '.') %>%
mutate(page_r = as.numeric(page_r)) %>%
ggplot() +
geom_density(aes(x=page_r)) +
xlab("Page Rank")
The local mean page rank is 0.0018692, compared to the global mean of 0.0001859.
Creating the graph of centrality measures for community 1.
attributes <- vertex_attr(g_comm_1_graph)
g_comm_1_nodes <- data.frame(name = attributes['name'], comm = attributes['comm'], g_e_values=attributes['g_e_values'], g_close_cent=attributes['g_close_cent'], g_page_rank = attributes['g_page_rank'], g_power_cent = attributes['g_power_cent'], local_e_values = attributes['local_e_values'], local_page_rank = attributes['local_page_rank'], local_close_cent = attributes['local_power_cent'], local_close_cent = attributes['local_close_cent'], stringsAsFactors=FALSE)Analysis of Global Comm 2
This is the creation of the subgraph that will only contain vertices listed in community 2.
g_comm_2_nodes <- top_comms_nodes %>%
filter(comm == 2)
g_comm_2_graph <- subgraph(projected, g_comm_2_nodes$name) e_values_2 <- g_comm_2_graph %>%
as_adjacency_matrix(type="both") %>%
eigen()
g_comm_2_graph <- g_comm_2_graph %>%
set_vertex_attr(name = "local_e_values", value = e_values_2$values)
e_values_2['values'] %>%
as.data.frame() %>%
ggplot()+
geom_density(aes(values)) +
xlab("Eigen value")
Eigen values are positively skewed for global community 2. This means that most nodes are not connected to high scoring nodes.
##Comm Centrality
##Closeness
close_cent <- g_comm_2_graph %>%
closeness()
var <- mean(close_cent)
g_comm_2_graph <- g_comm_2_graph %>%
set_vertex_attr(name = "local_close_cent", value = close_cent)
close_cent %>%
as.data.frame() %>%
dplyr::rename(closeness = '.') %>%
ggplot()+
geom_density(aes(closeness)) +
xlab("Closeness centrality")
The mean closeness centrality is 0.0008507.
power_cent <- g_comm_2_graph %>% power_centrality(exponent = 0.9)
var <- mean(power_cent)
g_comm_2_graph <- g_comm_2_graph %>%
set_vertex_attr(name = "local_power_cent", value = power_cent)
power_cent %>%
as.data.frame() %>%
dplyr::rename(power = '.') %>%
mutate(power = as.numeric(power)) %>%
ggplot() +
geom_density(aes(x=power))
The mean Boncich power centrality is -0.2881859.
##Page Rank
page_ranks <- g_comm_2_graph %>%
page_rank()
var <- mean(page_ranks$vector)
g_comm_2_graph <- g_comm_2_graph %>%
set_vertex_attr(name = "local_page_rank", value = page_ranks$vector)
page_ranks$vector %>%
as.data.frame() %>%
dplyr::rename(page_r = '.') %>%
mutate(page_r = as.numeric(page_r)) %>%
ggplot() +
geom_density(aes(x=page_r)) +
xlab("Page Rank")
The mean page rank for community 2 is 0.0021645.
Creating the graph of centrality measures for community 2.
attributes <- vertex_attr(g_comm_2_graph)
g_comm_2_nodes <- data.frame(name = attributes['name'], comm = attributes['comm'], g_e_values=attributes['g_e_values'], g_close_cent=attributes['g_close_cent'], g_page_rank = attributes['g_page_rank'], g_power_cent = attributes['g_power_cent'], local_e_values = attributes['local_e_values'], local_page_rank = attributes['local_page_rank'], local_close_cent = attributes['local_power_cent'], local_close_cent = attributes['local_close_cent'], stringsAsFactors=FALSE)Analysis of Global Comm 3
This is the creation of the subgraph that will only contain vertices listed in community 3.
g_comm_3_nodes <- top_comms_nodes %>%
filter(comm == 3)
g_comm_3_graph <- subgraph(projected, g_comm_3_nodes$name) e_values_3 <- g_comm_3_graph %>%
as_adjacency_matrix(type="both") %>%
eigen()
g_comm_3_graph <- g_comm_3_graph %>%
set_vertex_attr(name = "local_e_values", value = e_values_3$values)
e_values_3['values'] %>%
as.data.frame() %>%
ggplot()+
geom_density(aes(values)) +
xlab("Eigen value")
Eigen values are negatively skewed for global community 3. This means that most nodes are not connected to high scoring nodes.
close_cent <- g_comm_3_graph %>%
closeness()
var <- mean(close_cent)
g_comm_3_graph <- g_comm_3_graph %>%
set_vertex_attr(name = "local_close_cent", value = close_cent)
close_cent %>%
as.data.frame() %>%
dplyr::rename(closeness = '.') %>%
ggplot()+
geom_density(aes(closeness)) +
xlab("Closeness centrality")
The mean closeness centrality is 0.0007851.
power_cent <- g_comm_3_graph %>% power_centrality(exponent = 0.9)
var <- mean(power_cent)
g_comm_3_graph <- g_comm_3_graph %>%
set_vertex_attr(name = "local_power_cent", value = power_cent)
power_cent %>%
as.data.frame() %>%
dplyr::rename(power = '.') %>%
mutate(power = as.numeric(power)) %>%
ggplot() +
geom_density(aes(x=power))
The mean Boncich power centrality in community 3 is -0.0918364.
page_ranks <- g_comm_3_graph %>%
page_rank()
var <- mean(page_ranks$vector)
g_comm_3_graph <- g_comm_3_graph %>%
set_vertex_attr(name = "local_page_rank", value = page_ranks$vector)
page_ranks$vector %>%
as.data.frame() %>%
dplyr::rename(page_r = '.') %>%
mutate(page_r = as.numeric(page_r)) %>%
ggplot() +
geom_density(aes(x=page_r)) +
xlab("Page Rank")
The mean page rank is 0.0022422.
Creating the graph of centrality measures for community 3.
attributes <- vertex_attr(g_comm_3_graph)
g_comm_3_nodes <- data.frame(name = attributes['name'], comm = attributes['comm'], g_e_values=attributes['g_e_values'], g_close_cent=attributes['g_close_cent'], g_page_rank = attributes['g_page_rank'], g_power_cent = attributes['g_power_cent'], local_e_values = attributes['local_e_values'], local_page_rank = attributes['local_page_rank'], local_close_cent = attributes['local_power_cent'], local_close_cent = attributes['local_close_cent'], stringsAsFactors=FALSE)Highlighting nodes using different menasures of importance.
The top nodes from selected communities will be compared to see which measure is the best indicator of higher ratings.
Overall measures
Highest degree
## [1] "Morgan Freeman"
Morgan Freeman has the highest degree of any node in th graph and could therefore be seen as an influential node, however he may not be a central one. Morgan Freeman has acted with the greatest number of distinct actors according to the movies in this dataset.
Closeness centrality
## [1] "Highest global closeness centrality: Morgan Freeman"
Due to the high degree, it is not surprising that Morgan Freeman has the highest level of closeness centrality across the graph.
Highest Page Ranking
## [1] "Highest global Page rank: Morgan Freeman"
Morgan Freeman is considered the most important node by the PageRank algorithm
Highest Boncich Power centrality
## [1] "Highest global Boncich power centrality: Matt Keeslar"
This actor himself is not considered the most influential however he has the most influential connections.
Community 1
Highest degree
## [1] "Rupert Everett"
Closeness centrality
## [1] "Highest global closeness centrality: Nathan Lane"
## [1] "Highest local closeness centrality: Jim Belushi"
Highest Page Ranking
## [1] "Highest global page rank: Richard Schiff"
## [1] "Highest local page rank: Richard Schiff"
Highest Boncich Power centrality
## [1] "Highest global power centrality: Michael McGlone"
## [1] "Highest local power centrality: William Sanderson"
Community 2
Highest degree
## [1] "Vanessa Redgrave"
Closeness centrality
## [1] "Highest global closeness centrality: Robert Duvall"
## [1] "Highest global closeness centrality: Vanessa Redgrave"
Highest Page Ranking
## [1] "Highest global Page rank: Robert Duvall"
## [1] "Highest local Page rank: Vanessa Redgrave"
Highest Boncich Power centrality
## [1] "Highest global Boncich power centrality: Tabu"
## [1] "Highest local Boncich power centrality: Dan Fogler"
The actors Gary Coleman and Pamela Anderson have the highest global and local Boncich power centrality, respectively. This means that across the whole graph (but limited to vertices in community 2), Gary Coleman has the most powerful connections while Pamela Anderson has the highest number of powerful connections within community 2.
Community 3
Highest degree
## [1] "Eddie Marsan"
Closeness centrality
## [1] "Highest global closeness centrality: Steve Coogan"
## [1] "Highest local closeness centrality: Eddie Marsan"
Highest Page Ranking
## [1] "Highest global Page rank: Jon Favreau"
## [1] "Highest local Page rank: Eddie Marsan"
Highest Boncich Power centrality
## [1] "Highest global Boncich power centrality: Carter Jenkins"
## [1] "Highest local Boncich power centrality: Matthew R. Anderson"
The actors Tabu and Charlize Theron have the highest global and local Boncich power centrality, respectively. This means that across the whole graph (but limited to vertices in community 3), Tabu has the most powerful connections while Charlize Theron has the highest number of powerful connections within community 3.
Analysis of centrality measures and ratings
Here the average rating of movies starred in for each actor is calculated.
average_imdb_actor_ratings <- imdb %>%
select(imdb_score, actor_1_name) %>%
dplyr::rename(actor = actor_1_name) %>%
group_by(actor) %>%
mutate(avg_rating = mean(imdb_score)) %>%
select(-imdb_score) %>%
distinct(actor, .keep_all = T) %>%
na.omit()
temp2 <- imdb %>%
select(imdb_score, actor_2_name) %>%
dplyr::rename(actor = actor_2_name) %>%
group_by(actor) %>%
mutate(avg_rating = mean(imdb_score)) %>%
select(-imdb_score) %>%
distinct(actor, .keep_all = T) %>%
na.omit()
temp3 <- imdb %>%
select(imdb_score, actor_3_name) %>%
dplyr::rename(actor = actor_3_name) %>%
group_by(actor) %>%
mutate(avg_rating = mean(imdb_score)) %>%
select(-imdb_score) %>%
distinct(actor, .keep_all = T) %>%
na.omit()
average_imdb_actor_ratings <- full_join(average_imdb_actor_ratings, temp2) %>%
group_by(actor) %>%
summarise(avg_rating = mean(avg_rating)) ## Joining, by = c("actor", "avg_rating")
average_imdb_actor_ratings <- full_join(average_imdb_actor_ratings, temp3) %>%
group_by(actor) %>%
summarise(avg_rating = mean(avg_rating)) ## Joining, by = c("actor", "avg_rating")
Overall
## # A tibble: 1 x 2
## actor avg_rating
## <chr> <dbl>
## 1 Morgan Freeman 7.76
Morgan Freeman has an average movie rating of 7.7605. In terms of the overall graph, this actor has the highest degree, closeness centrality and page rank.
## # A tibble: 1 x 2
## actor avg_rating
## <chr> <dbl>
## 1 Matt Keeslar 7
Matt Keeslar has an average movie rating of 7. In terms of the overall graph, this actor has the highest Boncich centrality meaning he has very influential alters.
Community 1
## # A tibble: 1 x 2
## actor avg_rating
## <chr> <dbl>
## 1 Tom Wilkinson 7.08
Tom Wilkinson has an average movie rating of 7.077083. In terms of community 1, this actor has the highest degree, global closeness centrality and global as well as local page rank.
## # A tibble: 1 x 2
## actor avg_rating
## <chr> <dbl>
## 1 Miranda Richardson 6.86
Miranda Richardson has an average movie rating of 6.855. In terms of community 1, this actor has the highest local closeness centrality meaning she is very central within community 1 but not overall in the graph.
## # A tibble: 1 x 2
## actor avg_rating
## <chr> <dbl>
## 1 R. Marcos Taylor 7.9
R. Marcos Taylor has an average movie rating of 7.9. In terms of community 1, this actor has the highest global Boncich centrality meaning that across the graph he has influential alters.
## # A tibble: 1 x 2
## actor avg_rating
## <chr> <dbl>
## 1 Eric Sykes 7.6
Eric Sykes has an average movie rating of 7.6. In terms of community 1, this actor has the highest local Boncich centrality meaning that if only looking at community 1, Eric Sykes has the most influential alters.
Community 2
## # A tibble: 1 x 2
## actor avg_rating
## <chr> <dbl>
## 1 Scarlett Johansson 7.52
Scarlett Johansson has an average movie rating of 7.522159. In terms of community 2, this actor has the highest degree.
## # A tibble: 1 x 2
## actor avg_rating
## <chr> <dbl>
## 1 Kristin Scott Thomas 6.94
Kristin Scott Thomas has an average movie rating of 6.939583. In terms of community 2, this actor has the highest global closeness centrality meaning she is very central overall in the graph but not the most central if only looking at community 2.
## # A tibble: 1 x 2
## actor avg_rating
## <chr> <dbl>
## 1 Rachael Harris 6.21
Rachael Harris has an average movie rating of 6.208333. In terms of community 2, this actor has the highest local closeness centrality meaning she is very central within community 2 but not overall in the graph.
## # A tibble: 1 x 2
## actor avg_rating
## <chr> <dbl>
## 1 Steve Coogan 6.29
Steve Coogan has an average movie rating of 6.2875. In terms of community 2, this actor has the highest global page rank.
## # A tibble: 1 x 2
## actor avg_rating
## <chr> <dbl>
## 1 Richard Schiff 6.14
Richard Schiff has an average movie rating of 6.143333. In terms of community 2, this actor has the highest local page rank.
## # A tibble: 1 x 2
## actor avg_rating
## <chr> <dbl>
## 1 Gary Coleman 6.15
Gary Coleman has an average movie rating of 6.15. In terms of community 2, this actor has the highest global Boncich centrality and has influential alters across the network.
## # A tibble: 1 x 2
## actor avg_rating
## <chr> <dbl>
## 1 Pamela Anderson 5.5
Pamela Anderson has an average movie rating of 5.5. In terms of community 2, this actor has the highest local Boncich centrality and has influential alters within community 2.
#####Community 3
## # A tibble: 1 x 2
## actor avg_rating
## <chr> <dbl>
## 1 Morgan Freeman 7.76
Morgan Freeman has an average movie rating of 7.7605. In terms of community 3, this actor has the highest degree, closeness centrality and page rank in terms of both local and global calculations.
## # A tibble: 1 x 2
## actor avg_rating
## <chr> <dbl>
## 1 Tabu 7.8
Tabu acts primarily in Hindi films and is the only actor highlighted not from Western films. The average movie rating is 7.8 and in terms of community 3, Tabu has the highest global Boncich centrality.
## # A tibble: 1 x 2
## actor avg_rating
## <chr> <dbl>
## 1 Charlize Theron 6.59
Charlize Theron has an average movie rating of 6.586667 and has the highest local Boncich power centrality within community 3.
Centrality measures and ratings
## Joining, by = "name"
## Joining, by = "name"
## Joining, by = "name"
ratings_and_centrality %>%
ggplot() +
geom_smooth(aes(x = g_e_values, y = avg_rating), method = 'loess') +
xlab("Global Eigen values") +
ylab("Avg movie rating") +
ggtitle("Global Eigen values vs Average Movie rating")ratings_and_centrality %>%
ggplot() +
geom_smooth(aes(x = g_close_cent, y = avg_rating), method = 'loess') +
xlab("Global Closeness centrality values") +
ylab("Avg movie rating") +
ggtitle("Global Closeness Centrality vs Average Movie rating")ratings_and_centrality %>%
ggplot() +
geom_smooth(aes(x = g_page_rank, y = avg_rating), method = 'loess') +
xlab("Global Page Rank values") +
ylab("Avg movie rating") +
ggtitle("Global Page Rank vs Average Movie rating")ratings_and_centrality %>%
ggplot() +
geom_smooth(aes(x = g_power_cent, y = avg_rating), method = 'loess') +
xlab("Global Power Centrality") +
ylab("Avg movie rating") +
ggtitle("Global Boncich Power Centrality vs Average Movie rating")
None of the graphs show any strong correlation between the global centrality and the average rating of the movie. It will now be explored whether using local centrality measures will produce a different outcome.
ratings_and_centrality %>%
ggplot() +
geom_smooth(aes(x = local_e_values, y = avg_rating), method = 'loess') +
xlab("Eigen vector Centrality") +
ylab("Avg movie rating") +
ggtitle("Local Eigen vector Centrality vs Average Movie rating")ratings_and_centrality %>%
ggplot() +
geom_smooth(aes(x = local_close_cent, y = avg_rating), method = 'loess') +
xlab("Closeness Centrality") +
ylab("Avg movie rating") +
ggtitle("Local Closeness Centrality vs Average Movie rating")ratings_and_centrality %>%
ggplot() +
geom_smooth(aes(x = local_page_rank, y = avg_rating), method = 'loess') +
xlab("Page rank") +
ylab("Avg movie rating") +
ggtitle("Local Page rank vs Average Movie rating")ratings_and_centrality %>%
ggplot() +
geom_smooth(aes(x = local_power_cent, y = avg_rating), method = 'loess') +
xlab("Power Centrality") +
ylab("Avg movie rating") +
ggtitle("Local Boncich Power Centrality vs Average Movie rating")
The local centralities do not appear to have any correlation to the average movie rating.
ratings_and_centrality %>%
group_by(comm) %>%
summarise(avg_rating = mean(avg_rating), avg_e = mean(g_e_values), avg_c_c = mean(g_close_cent), avg_p_r = mean(g_page_rank), avg_p_c = mean(g_power_cent), num = n()) ## # A tibble: 3 x 7
## comm avg_rating avg_e avg_c_c avg_p_r avg_p_c num
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 6.26 -0.225 0.000000142 0.000122 0.0292 535
## 2 2 6.53 1.01 0.000000142 0.000219 -0.0287 462
## 3 3 6.45 1.69 0.000000142 0.000185 -0.0677 446
Across the three graphs, the mean centrality scores remain constant across communitites with the exception of the eigen vector centrality of community 1.
Concluding remarks
It can be said that the centrality of nodes is not an indicator of success for movie ratings. The variance in movie ratings is relatively high for very central nodes as well as not as central nodes. There is doubt as to the reliability of the communities due to the densely connected nature of the graph. The